Group 3 for MSDS 7331 Lab One: Visualization and Data Preprocessing

Analysis of Online News Popularity Dataset (https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity): explore the statistical summaries of the features, visualize the attributes, and make conclusions from the visualizations and analysis

01. Business Understanding (10)

  • Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). Describe how you would define and measure the outcomes from the dataset. That is, why is this data important and how do you know if you have mined useful knowledge from the dataset? How would you measure the effectiveness of a good prediction algorithm? Be specific *

This Online News Popularity Dataset was acquisited from Mashable (https://mashable.com) on 01/08/2015 and the goal is to predict the number of shares in social networks (popularity).

Attribute Information:

02. Data Mining Type (10)

  • Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file.

    Number of Attributes: 61 (58 predictive attributes, 2 non-predictive, 1 goal field)

  • url: URL of the article (non-predictive)

  • timedelta: Days between the article publication and the dataset acquisition (non-predictive)
  • n_tokens_title: Number of words in the title
  • n_tokens_content: Number of words in the content
  • n_unique_tokens: Rate of unique words in the content
  • n_non_stop_words: Rate of non-stop words in the content
  • n_non_stop_unique_tokens: Rate of unique non-stop words in the content
  • num_hrefs: Number of links
  • num_self_hrefs: Number of links to other articles published by Mashable
  • num_imgs: Number of images
  • num_videos: Number of videos
  • average_token_length: Average length of the words in the content
  • num_keywords: Number of keywords in the metadata
  • data_channel_is_lifestyle: Is data channel 'Lifestyle'?
  • data_channel_is_entertainment: Is data channel 'Entertainment'?
  • data_channel_is_bus: Is data channel 'Business'?
  • data_channel_is_socmed: Is data channel 'Social Media'?
  • data_channel_is_tech: Is data channel 'Tech'?
  • data_channel_is_world: Is data channel 'World'?
  • kw_min_min: Worst keyword (min. shares)
  • kw_max_min: Worst keyword (max. shares)
  • kw_avg_min: Worst keyword (avg. shares)
  • kw_min_max: Best keyword (min. shares)
  • kw_max_max: Best keyword (max. shares)
  • kw_avg_max: Best keyword (avg. shares)
  • kw_min_avg: Avg. keyword (min. shares)
  • kw_max_avg: Avg. keyword (max. shares)
  • kw_avg_avg: Avg. keyword (avg. shares)
  • self_reference_min_shares: Min. shares of referenced articles in Mashable
  • self_reference_max_shares: Max. shares of referenced articles in Mashable
  • self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
  • weekday_is_monday: Was the article published on a Monday?
  • weekday_is_tuesday: Was the article published on a Tuesday?
  • weekday_is_wednesday: Was the article published on a Wednesday?
  • weekday_is_thursday: Was the article published on a Thursday?
  • weekday_is_friday: Was the article published on a Friday?
  • weekday_is_saturday: Was the article published on a Saturday?
  • weekday_is_sunday: Was the article published on a Sunday?
  • is_weekend: Was the article published on the weekend?
  • LDA_00: Closeness to LDA topic 0
  • LDA_01: Closeness to LDA topic 1
  • LDA_02: Closeness to LDA topic 2
  • LDA_03: Closeness to LDA topic 3
  • LDA_04: Closeness to LDA topic 4
  • global_subjectivity: Text subjectivity
  • global_sentiment_polarity: Text sentiment polarity
  • global_rate_positive_words: Rate of positive words in the content
  • global_rate_negative_words: Rate of negative words in the content
  • rate_positive_words: Rate of positive words among non-neutral tokens
  • rate_negative_words: Rate of negative words among non-neutral tokens
  • avg_positive_polarity: Avg. polarity of positive words
  • min_positive_polarity: Min. polarity of positive words
  • max_positive_polarity: Max. polarity of positive words
  • avg_negative_polarity: Avg. polarity of negative words
  • min_negative_polarity: Min. polarity of negative words
  • max_negative_polarity: Max. polarity of negative words
  • title_subjectivity: Title subjectivity
  • title_sentiment_polarity: Title polarity
  • abs_title_subjectivity: Absolute subjectivity level
  • abs_title_sentiment_polarity: Absolute polarity level
  • shares: Number of shares (target) *

Can use Manisha's code for presentation

03. Data Quality (15)

  • Verify data quality: Explain any missing values, duplicate data, and outliers. Are those mistakes? How do you deal with these problems? Give justifications for your methods. *
In [1]:
# Import libraries which will be uses for Lab_01 project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
# Read csv file
df = pd.read_csv('/Users/shanqinggu/Desktop/OnlineNewsPopularity.csv')
df.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39644 entries, 0 to 39643
Data columns (total 61 columns):
url                               39644 non-null object
 timedelta                        39644 non-null float64
 n_tokens_title                   39644 non-null float64
 n_tokens_content                 39644 non-null float64
 n_unique_tokens                  39644 non-null float64
 n_non_stop_words                 39644 non-null float64
 n_non_stop_unique_tokens         39644 non-null float64
 num_hrefs                        39644 non-null float64
 num_self_hrefs                   39644 non-null float64
 num_imgs                         39644 non-null float64
 num_videos                       39644 non-null float64
 average_token_length             39644 non-null float64
 num_keywords                     39644 non-null float64
 data_channel_is_lifestyle        39644 non-null float64
 data_channel_is_entertainment    39644 non-null float64
 data_channel_is_bus              39644 non-null float64
 data_channel_is_socmed           39644 non-null float64
 data_channel_is_tech             39644 non-null float64
 data_channel_is_world            39644 non-null float64
 kw_min_min                       39644 non-null float64
 kw_max_min                       39644 non-null float64
 kw_avg_min                       39644 non-null float64
 kw_min_max                       39644 non-null float64
 kw_max_max                       39644 non-null float64
 kw_avg_max                       39644 non-null float64
 kw_min_avg                       39644 non-null float64
 kw_max_avg                       39644 non-null float64
 kw_avg_avg                       39644 non-null float64
 self_reference_min_shares        39644 non-null float64
 self_reference_max_shares        39644 non-null float64
 self_reference_avg_sharess       39644 non-null float64
 weekday_is_monday                39644 non-null float64
 weekday_is_tuesday               39644 non-null float64
 weekday_is_wednesday             39644 non-null float64
 weekday_is_thursday              39644 non-null float64
 weekday_is_friday                39644 non-null float64
 weekday_is_saturday              39644 non-null float64
 weekday_is_sunday                39644 non-null float64
 is_weekend                       39644 non-null float64
 LDA_00                           39644 non-null float64
 LDA_01                           39644 non-null float64
 LDA_02                           39644 non-null float64
 LDA_03                           39644 non-null float64
 LDA_04                           39644 non-null float64
 global_subjectivity              39644 non-null float64
 global_sentiment_polarity        39644 non-null float64
 global_rate_positive_words       39644 non-null float64
 global_rate_negative_words       39644 non-null float64
 rate_positive_words              39644 non-null float64
 rate_negative_words              39644 non-null float64
 avg_positive_polarity            39644 non-null float64
 min_positive_polarity            39644 non-null float64
 max_positive_polarity            39644 non-null float64
 avg_negative_polarity            39644 non-null float64
 min_negative_polarity            39644 non-null float64
 max_negative_polarity            39644 non-null float64
 title_subjectivity               39644 non-null float64
 title_sentiment_polarity         39644 non-null float64
 abs_title_subjectivity           39644 non-null float64
 abs_title_sentiment_polarity     39644 non-null float64
 shares                           39644 non-null int64
dtypes: float64(59), int64(1), object(1)
memory usage: 18.5+ MB
In [3]:
# Exclude url and timedelta columns, read from n_tokens_title

df = df.loc[:, ' n_tokens_title':]
df.head() # use df.tail() to read from the bottom
Out[3]:
n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length ... min_positive_polarity max_positive_polarity avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares
0 12.0 219.0 0.663594 1.0 0.815385 4.0 2.0 1.0 0.0 4.680365 ... 0.100000 0.7 -0.350000 -0.600 -0.200000 0.500000 -0.187500 0.000000 0.187500 593
1 9.0 255.0 0.604743 1.0 0.791946 3.0 1.0 1.0 0.0 4.913725 ... 0.033333 0.7 -0.118750 -0.125 -0.100000 0.000000 0.000000 0.500000 0.000000 711
2 9.0 211.0 0.575130 1.0 0.663866 3.0 1.0 1.0 0.0 4.393365 ... 0.100000 1.0 -0.466667 -0.800 -0.133333 0.000000 0.000000 0.500000 0.000000 1500
3 9.0 531.0 0.503788 1.0 0.665635 9.0 0.0 1.0 0.0 4.404896 ... 0.136364 0.8 -0.369697 -0.600 -0.166667 0.000000 0.000000 0.500000 0.000000 1200
4 13.0 1072.0 0.415646 1.0 0.540890 19.0 19.0 20.0 0.0 4.682836 ... 0.033333 1.0 -0.220192 -0.500 -0.050000 0.454545 0.136364 0.045455 0.136364 505

5 rows × 59 columns

In [4]:
# Combine and make 'channel', will fix the warn contents below

Lifestyle_df=df[df[' data_channel_is_lifestyle']==1]
Lifestyle_df[' Channel']='Lifestyle'

Entertainment_df=df[df[' data_channel_is_entertainment']==1]
Entertainment_df[' Channel']='Entertainment'

Bus_df=df[df[' data_channel_is_bus']==1]
Bus_df[' Channel']='Bus'

Socmed_df=df[df[' data_channel_is_socmed']==1]
Socmed_df[' Channel']='Socmedia'

Tech_df=df[df[' data_channel_is_tech']==1]
Tech_df[' Channel']='Tech'

World_df=df[df[' data_channel_is_world']==1]
World_df[' Channel']='World'

df=pd.concat([Lifestyle_df,Entertainment_df,Bus_df,Socmed_df,Tech_df,World_df],axis=0)
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [5]:
# Combine and make 'weekday', will fix the warn contents below

Monday_df=df[df[' weekday_is_monday']==1]
Monday_df[' weekday']='Monday'

Tuesday_df=df[df[' weekday_is_tuesday']==1]
Tuesday_df[' weekday']='Tuesday'

Wednesday_df=df[df[' weekday_is_wednesday']==1]
Wednesday_df[' weekday']='Wednesday'

Thursday_df=df[df[' weekday_is_thursday']==1]
Thursday_df[' weekday']='Thursday'

Friday_df=df[df[' weekday_is_friday']==1]
Friday_df[' weekday']='Friday'

Saturday_df=df[df[' weekday_is_saturday']==1]
Saturday_df[' weekday']='Saturday'

Sunday_df=df[df[' weekday_is_sunday']==1]
Sunday_df[' weekday']='Sunday'

df=pd.concat([Monday_df,Tuesday_df,Wednesday_df,Thursday_df,Friday_df,Saturday_df,Sunday_df],axis=0)
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/Users/shanqinggu/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:22: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [6]:
# Check column location and prepare to drop

df.columns.get_loc(' data_channel_is_lifestyle')
df.columns.get_loc(' data_channel_is_world')
df.columns.get_loc(' weekday_is_monday')
df.columns.get_loc(' is_weekend') 
Out[6]:
36
In [7]:
df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]]
Out[7]:
Index([' data_channel_is_lifestyle', ' data_channel_is_entertainment',
       ' data_channel_is_bus', ' data_channel_is_socmed',
       ' data_channel_is_tech', ' data_channel_is_world', ' weekday_is_monday',
       ' weekday_is_tuesday', ' weekday_is_wednesday', ' weekday_is_thursday',
       ' weekday_is_friday', ' weekday_is_saturday', ' weekday_is_sunday',
       ' is_weekend'],
      dtype='object')
In [8]:
# Remove previous channel and weekly columns

df.drop(df.columns[[11, 12, 13, 14, 15, 16, 29, 30, 31,32, 33, 34, 35, 36 ]], axis=1, inplace=True)
In [9]:
# No Missing values in this dataset
pd.isnull(df).sum()
Out[9]:
 n_tokens_title                  0
 n_tokens_content                0
 n_unique_tokens                 0
 n_non_stop_words                0
 n_non_stop_unique_tokens        0
 num_hrefs                       0
 num_self_hrefs                  0
 num_imgs                        0
 num_videos                      0
 average_token_length            0
 num_keywords                    0
 kw_min_min                      0
 kw_max_min                      0
 kw_avg_min                      0
 kw_min_max                      0
 kw_max_max                      0
 kw_avg_max                      0
 kw_min_avg                      0
 kw_max_avg                      0
 kw_avg_avg                      0
 self_reference_min_shares       0
 self_reference_max_shares       0
 self_reference_avg_sharess      0
 LDA_00                          0
 LDA_01                          0
 LDA_02                          0
 LDA_03                          0
 LDA_04                          0
 global_subjectivity             0
 global_sentiment_polarity       0
 global_rate_positive_words      0
 global_rate_negative_words      0
 rate_positive_words             0
 rate_negative_words             0
 avg_positive_polarity           0
 min_positive_polarity           0
 max_positive_polarity           0
 avg_negative_polarity           0
 min_negative_polarity           0
 max_negative_polarity           0
 title_subjectivity              0
 title_sentiment_polarity        0
 abs_title_subjectivity          0
 abs_title_sentiment_polarity    0
 shares                          0
 Channel                         0
 weekday                         0
dtype: int64
In [10]:
# No duplicated values in this dataset

df[df.duplicated(keep=False)]
Out[10]:
n_tokens_title n_tokens_content n_unique_tokens n_non_stop_words n_non_stop_unique_tokens num_hrefs num_self_hrefs num_imgs num_videos average_token_length ... avg_negative_polarity min_negative_polarity max_negative_polarity title_subjectivity title_sentiment_polarity abs_title_subjectivity abs_title_sentiment_polarity shares Channel weekday

0 rows × 47 columns

In [ ]:
# Outliers will be handled by log transformation due to the sample numbers are more than 30k
In [11]:
# Find which variables need to do log transform 

df_T = df.describe().T

df_T["log"] = (df_T["max"] > df_T["50%"] * 10) & (df_T["max"] > 1)
df_T["log+2"] = df_T["log"] & (df_T["min"] < 0)

df_T["scale"] = ""

df_T.loc[df_T["log"],"scale"] = "log"
df_T.loc[df_T["log+2"],"scale"] = "log+2"

df_T[["mean", "min", "50%", "max", "scale"]]
Out[11]:
mean min 50% max scale
n_tokens_title 10.416204 2.000000 10.000000 23.000000
n_tokens_content 585.438317 0.000000 447.000000 8474.000000 log
n_unique_tokens 0.549702 0.000000 0.532596 701.000000 log
n_non_stop_words 1.015010 0.000000 1.000000 1042.000000 log
n_non_stop_unique_tokens 0.699134 0.000000 0.690411 650.000000 log
num_hrefs 10.380603 0.000000 7.000000 304.000000 log
num_self_hrefs 3.368517 0.000000 3.000000 116.000000 log
num_imgs 3.959445 0.000000 1.000000 128.000000 log
num_videos 0.998448 0.000000 0.000000 75.000000 log
average_token_length 4.607736 0.000000 4.669471 7.695652
num_keywords 7.177798 1.000000 7.000000 10.000000
kw_min_min 25.539421 -1.000000 -1.000000 377.000000 log+2
kw_max_min 1114.712389 0.000000 656.000000 298400.000000 log
kw_avg_min 309.083875 -1.000000 237.316667 42827.857143 log+2
kw_min_max 12207.393972 0.000000 1300.000000 843300.000000 log
kw_max_max 753337.156073 0.000000 843300.000000 843300.000000
kw_avg_max 241773.119564 0.000000 228286.666666 843300.000000
kw_min_avg 1031.643292 -1.000000 956.466667 3610.124972
kw_max_avg 5161.325150 0.000000 4044.559329 298400.000000 log
kw_avg_avg 2890.423023 0.000000 2737.047950 43567.659946 log
self_reference_min_shares 3565.717849 0.000000 1100.000000 690400.000000 log
self_reference_max_shares 9594.667448 0.000000 2700.000000 837700.000000 log
self_reference_avg_sharess 5820.979727 0.000000 2100.000000 690400.000000 log
LDA_00 0.207364 0.000000 0.040000 0.926994
LDA_01 0.141303 0.000000 0.033339 0.925947
LDA_02 0.245294 0.000000 0.050001 0.919999
LDA_03 0.140653 0.000000 0.033373 0.925542
LDA_04 0.265356 0.000000 0.050763 0.927191
global_subjectivity 0.439569 0.000000 0.447111 1.000000
global_sentiment_polarity 0.119596 -0.377657 0.119719 0.727841
global_rate_positive_words 0.039891 0.000000 0.039164 0.155488
global_rate_negative_words 0.016320 0.000000 0.015209 0.139831
rate_positive_words 0.695364 0.000000 0.714286 1.000000
rate_negative_words 0.288462 0.000000 0.277778 1.000000
avg_positive_polarity 0.351704 0.000000 0.354545 1.000000
min_positive_polarity 0.091221 0.000000 0.100000 1.000000
max_positive_polarity 0.762561 0.000000 0.800000 1.000000
avg_negative_polarity -0.255202 -1.000000 -0.250000 0.000000
min_negative_polarity -0.524040 -1.000000 -0.500000 0.000000
max_negative_polarity -0.104041 -1.000000 -0.100000 0.000000
title_subjectivity 0.265740 0.000000 0.100000 1.000000
title_sentiment_polarity 0.068661 -1.000000 0.000000 1.000000
abs_title_subjectivity 0.343692 0.000000 0.500000 0.500000
abs_title_sentiment_polarity 0.145570 0.000000 0.000000 1.000000
shares 2928.637989 1.000000 1400.000000 690400.000000 log
In [12]:
# Log transform 18 variables. 

df['log_n_tokens_content'] = np.log(df[' n_tokens_content'] + 0.1)
df['log_n_unique_tokens'] = np.log(df[' n_unique_tokens'] + 0.1) 
df['log_n_non_stop_words'] = np.log(df[' n_non_stop_words'] + 0.1)
df['log_n_non_stop_unique_tokens'] = np.log(df[' n_non_stop_unique_tokens'] + 0.1)

df['log_num_hrefs'] = np.log(df[' num_hrefs'] + 0.1)
df['log_num_self_hrefs'] = np.log(df[' num_self_hrefs'] + 0.1)
df['log_num_imgs'] = np.log(df[' num_imgs'] + 0.1)
df['log_num_videos'] = np.log(df[' num_videos'] + 0.1)

df['log_kw_min_min'] = np.log(df[' kw_min_min'] + 2)
df['log_kw_max_min'] = np.log(df[' kw_max_min'] + 0.1)
df['log_kw_avg_min'] = np.log(df[' kw_avg_min'] + 2)

df['log_kw_min_max'] = np.log(df[' kw_min_max'] + 0.1)

df['log_kw_max_avg'] = np.log(df[' kw_max_avg'] + 0.1)
df['log_kw_avg_avg'] = np.log(df[' kw_avg_avg'] + 0.1)

df['log_self_reference_min_shares'] = np.log(df[' self_reference_min_shares'] + 0.1)
df['log_self_reference_max_shares'] = np.log(df[' self_reference_max_shares'] + 0.1)
df['log_self_reference_avg_sharess'] = np.log(df[' self_reference_avg_sharess'] + 0.1)

df['log_shares'] = np.log(df[' shares'] + 0.1)
In [ ]:

In [13]:
# find locations for untransformed

df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]]
Out[13]:
Index([' n_tokens_content', ' n_unique_tokens', ' n_non_stop_words',
       ' n_non_stop_unique_tokens', ' num_hrefs', ' num_self_hrefs',
       ' num_imgs', ' num_videos', ' kw_min_min', ' kw_max_min', ' kw_avg_min',
       ' kw_min_max', ' kw_max_avg', ' kw_avg_avg',
       ' self_reference_min_shares', ' self_reference_max_shares',
       ' self_reference_avg_sharess', ' shares'],
      dtype='object')
In [14]:
# Drop the above

df.drop(df.columns[[1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 18, 19, 20, 21, 22, 44]], axis=1, inplace=True)
In [15]:
# Check if everything correct so far
df.dtypes
Out[15]:
 n_tokens_title                   float64
 average_token_length             float64
 num_keywords                     float64
 kw_max_max                       float64
 kw_avg_max                       float64
 kw_min_avg                       float64
 LDA_00                           float64
 LDA_01                           float64
 LDA_02                           float64
 LDA_03                           float64
 LDA_04                           float64
 global_subjectivity              float64
 global_sentiment_polarity        float64
 global_rate_positive_words       float64
 global_rate_negative_words       float64
 rate_positive_words              float64
 rate_negative_words              float64
 avg_positive_polarity            float64
 min_positive_polarity            float64
 max_positive_polarity            float64
 avg_negative_polarity            float64
 min_negative_polarity            float64
 max_negative_polarity            float64
 title_subjectivity               float64
 title_sentiment_polarity         float64
 abs_title_subjectivity           float64
 abs_title_sentiment_polarity     float64
 Channel                           object
 weekday                           object
log_n_tokens_content              float64
log_n_unique_tokens               float64
log_n_non_stop_words              float64
log_n_non_stop_unique_tokens      float64
log_num_hrefs                     float64
log_num_self_hrefs                float64
log_num_imgs                      float64
log_num_videos                    float64
log_kw_min_min                    float64
log_kw_max_min                    float64
log_kw_avg_min                    float64
log_kw_min_max                    float64
log_kw_max_avg                    float64
log_kw_avg_avg                    float64
log_self_reference_min_shares     float64
log_self_reference_max_shares     float64
log_self_reference_avg_sharess    float64
log_shares                        float64
dtype: object
In [16]:
# Linear Regression analysis. This part will be moved to # 9. Put here just to show the process to pick interesting variables.

class_y = df.log_shares
class_X = df.drop(['log_shares', ' Channel', ' weekday'], axis=1) # axis = 1 -  column

import statsmodels.api as sm
class_X = sm.add_constant(class_X)
ls_model = sm.OLS(class_y.astype(float), class_X.astype(float)).fit()
ls_model.summary()
Out[16]:
OLS Regression Results
Dep. Variable: log_shares R-squared: 0.101
Model: OLS Adj. R-squared: 0.099
Method: Least Squares F-statistic: 85.00
Date: Thu, 13 Sep 2018 Prob (F-statistic): 0.00
Time: 17:12:12 Log-Likelihood: -41360.
No. Observations: 33510 AIC: 8.281e+04
Df Residuals: 33465 BIC: 8.319e+04
Df Model: 44
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.2988 1.665 4.383 0.000 4.035 10.563
n_tokens_title -0.0004 0.002 -0.177 0.859 -0.005 0.004
average_token_length -0.0669 0.021 -3.220 0.001 -0.108 -0.026
num_keywords 0.0244 0.003 8.104 0.000 0.018 0.030
kw_max_max -1.765e-07 4.13e-08 -4.275 0.000 -2.57e-07 -9.56e-08
kw_avg_max -2.616e-08 6.48e-08 -0.404 0.686 -1.53e-07 1.01e-07
kw_min_avg 0.0003 1.19e-05 25.954 0.000 0.000 0.000
LDA_00 -1.4414 2.079 -0.693 0.488 -5.517 2.634
LDA_01 -1.8100 2.079 -0.871 0.384 -5.885 2.265
LDA_02 -1.7942 2.080 -0.863 0.388 -5.871 2.282
LDA_03 -1.7301 2.079 -0.832 0.405 -5.805 2.345
LDA_04 -1.4611 2.079 -0.703 0.482 -5.536 2.614
global_subjectivity 0.4215 0.070 6.048 0.000 0.285 0.558
global_sentiment_polarity -0.0956 0.143 -0.668 0.504 -0.376 0.185
global_rate_positive_words 0.2555 0.603 0.423 0.672 -0.927 1.438
global_rate_negative_words 0.8708 1.186 0.734 0.463 -1.453 3.195
rate_positive_words 0.5442 0.483 1.128 0.259 -0.402 1.490
rate_negative_words 0.3832 0.486 0.788 0.431 -0.570 1.336
avg_positive_polarity -0.0585 0.111 -0.525 0.599 -0.277 0.160
min_positive_polarity -0.2368 0.097 -2.451 0.014 -0.426 -0.047
max_positive_polarity -0.0281 0.034 -0.820 0.412 -0.095 0.039
avg_negative_polarity -0.0724 0.099 -0.729 0.466 -0.267 0.122
min_negative_polarity -0.0385 0.036 -1.076 0.282 -0.109 0.032
max_negative_polarity 0.0497 0.087 0.573 0.567 -0.120 0.220
title_subjectivity 0.0482 0.022 2.169 0.030 0.005 0.092
title_sentiment_polarity 0.0671 0.021 3.209 0.001 0.026 0.108
abs_title_subjectivity 0.1272 0.029 4.315 0.000 0.069 0.185
abs_title_sentiment_polarity 0.0503 0.032 1.558 0.119 -0.013 0.114
log_n_tokens_content -0.0943 0.022 -4.346 0.000 -0.137 -0.052
log_n_unique_tokens -0.6368 0.141 -4.502 0.000 -0.914 -0.360
log_n_non_stop_words 0.1798 0.228 0.789 0.430 -0.267 0.627
log_n_non_stop_unique_tokens 0.2569 0.113 2.282 0.022 0.036 0.477
log_num_hrefs 0.0817 0.007 11.102 0.000 0.067 0.096
log_num_self_hrefs -0.0332 0.006 -5.153 0.000 -0.046 -0.021
log_num_imgs 0.0166 0.004 4.090 0.000 0.009 0.025
log_num_videos 0.0411 0.004 11.374 0.000 0.034 0.048
log_kw_min_min -0.0075 0.005 -1.437 0.151 -0.018 0.003
log_kw_max_min -0.0601 0.010 -5.788 0.000 -0.080 -0.040
log_kw_avg_min 0.1060 0.015 6.846 0.000 0.076 0.136
log_kw_min_max -0.0514 0.002 -23.299 0.000 -0.056 -0.047
log_kw_max_avg 0.0228 0.022 1.024 0.306 -0.021 0.066
log_kw_avg_avg 0.1115 0.027 4.192 0.000 0.059 0.164
log_self_reference_min_shares -0.0559 0.011 -5.243 0.000 -0.077 -0.035
log_self_reference_max_shares -0.1583 0.024 -6.549 0.000 -0.206 -0.111
log_self_reference_avg_sharess 0.2407 0.033 7.318 0.000 0.176 0.305
Omnibus: 6140.349 Durbin-Watson: 1.906
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17302.005
Skew: 0.978 Prob(JB): 0.00
Kurtosis: 5.927 Cond. No. 9.01e+08


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.01e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

!!! Strong multicollinearity

Choose 20 from above and remove problem varibles based on pairplot (not shown)

average_token_length num_keywords kw_max_max kw_min_avg global_subjectivity title_sentiment_polarity abs_title_subjectivity log_n_tokens_content log_n_unique_tokens log_num_hrefs log_num_self_hrefs log_num_imgs log_num_videos log_kw_max_min log_kw_avg_min log_kw_min_max log_kw_avg_avg log_self_reference_min_shares log_self_reference_max_shares log_self_reference_avg_sharess

In [17]:
# Remove some varibles after checking pairplot and keep 12 and target 'log_shares'.

df_clean = df[[' average_token_length', ' num_keywords', ' global_subjectivity',' title_sentiment_polarity',
               ' abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs',
               'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess', 'log_shares']]

# Recheck linear regression analysis.

clean_y = df_clean.log_shares
clean_X = df_clean.drop(['log_shares'], axis=1) # axis = 1 -  column

import statsmodels.api as sm
clean_X = sm.add_constant(clean_X)
clean_ls_model = sm.OLS(clean_y.astype(float), clean_X.astype(float)).fit()
clean_ls_model.summary()
Out[17]:
OLS Regression Results
Dep. Variable: log_shares R-squared: 0.038
Model: OLS Adj. R-squared: 0.037
Method: Least Squares F-statistic: 108.9
Date: Thu, 13 Sep 2018 Prob (F-statistic): 3.99e-267
Time: 17:12:18 Log-Likelihood: -42494.
No. Observations: 33510 AIC: 8.501e+04
Df Residuals: 33497 BIC: 8.512e+04
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 7.3684 0.082 89.445 0.000 7.207 7.530
average_token_length -0.1713 0.018 -9.536 0.000 -0.207 -0.136
num_keywords 0.0221 0.002 8.945 0.000 0.017 0.027
global_subjectivity 0.8297 0.059 14.108 0.000 0.714 0.945
title_sentiment_polarity 0.1496 0.019 7.737 0.000 0.112 0.188
abs_title_subjectivity 0.0832 0.026 3.218 0.001 0.033 0.134
log_n_tokens_content -0.0130 0.007 -1.775 0.076 -0.027 0.001
log_n_unique_tokens -0.0737 0.033 -2.265 0.024 -0.138 -0.010
log_num_hrefs 0.0952 0.007 13.577 0.000 0.081 0.109
log_num_self_hrefs -0.0490 0.006 -8.416 0.000 -0.060 -0.038
log_num_imgs 0.0084 0.004 2.161 0.031 0.001 0.016
log_num_videos 0.0230 0.003 6.718 0.000 0.016 0.030
log_self_reference_avg_sharess 0.0300 0.002 16.188 0.000 0.026 0.034
Omnibus: 5596.895 Durbin-Watson: 1.841
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13944.291
Skew: 0.936 Prob(JB): 0.00
Kurtosis: 5.546 Cond. No. 250.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

!!! no multicollinearity now

04. Simple Statistics (10)

  • Visualize appropriate statistics (e.g., range, mode, mean, median, variance, counts) for a subset of attributes. Describe anything meaningful you found from this or if you found something potentially interesting. Note: You can also use data from other sources for comparison. Explain why the statistics run are meaningful. *
In [18]:
# Quick statistic summary of the data
df.describe().transpose()
Out[18]:
count mean std min 25% 50% 75% max
n_tokens_title 33510.0 10.416204 2.134557 2.000000 9.000000 10.000000 12.000000 23.000000
average_token_length 33510.0 4.607736 0.646702 0.000000 4.490385 4.669471 4.852841 7.695652
num_keywords 33510.0 7.177798 1.952422 1.000000 6.000000 7.000000 9.000000 10.000000
kw_max_max 33510.0 753337.156073 213034.725810 0.000000 843300.000000 843300.000000 843300.000000 843300.000000
kw_avg_max 33510.0 241773.119564 122864.851517 0.000000 165789.285715 228286.666666 310000.000000 843300.000000
kw_min_avg 33510.0 1031.643292 1067.121534 -1.000000 0.000000 956.466667 1936.365694 3610.124972
LDA_00 33510.0 0.207364 0.277868 0.000000 0.025188 0.040000 0.320122 0.926994
LDA_01 33510.0 0.141303 0.224786 0.000000 0.025005 0.033339 0.134308 0.925947
LDA_02 33510.0 0.245294 0.295739 0.000000 0.028573 0.050001 0.432298 0.919999
LDA_03 33510.0 0.140653 0.224266 0.000000 0.025033 0.033373 0.136139 0.925542
LDA_04 33510.0 0.265356 0.301748 0.000000 0.029340 0.050763 0.485458 0.927191
global_subjectivity 33510.0 0.439569 0.099129 0.000000 0.393133 0.447111 0.497383 1.000000
global_sentiment_polarity 33510.0 0.119596 0.090615 -0.377657 0.062231 0.119719 0.175602 0.727841
global_rate_positive_words 33510.0 0.039891 0.016444 0.000000 0.028889 0.039164 0.050205 0.155488
global_rate_negative_words 33510.0 0.016320 0.009943 0.000000 0.009768 0.015209 0.021277 0.139831
rate_positive_words 33510.0 0.695364 0.171064 0.000000 0.611111 0.714286 0.800000 1.000000
rate_negative_words 33510.0 0.288462 0.150604 0.000000 0.188679 0.277778 0.379310 1.000000
avg_positive_polarity 33510.0 0.351704 0.091821 0.000000 0.305024 0.354545 0.403922 1.000000
min_positive_polarity 33510.0 0.091221 0.064398 0.000000 0.050000 0.100000 0.100000 1.000000
max_positive_polarity 33510.0 0.762561 0.231811 0.000000 0.600000 0.800000 1.000000 1.000000
avg_negative_polarity 33510.0 -0.255202 0.117386 -1.000000 -0.318403 -0.250000 -0.186667 0.000000
min_negative_polarity 33510.0 -0.524040 0.283454 -1.000000 -0.700000 -0.500000 -0.300000 0.000000
max_negative_polarity 33510.0 -0.104041 0.086693 -1.000000 -0.125000 -0.100000 -0.050000 0.000000
title_subjectivity 33510.0 0.265740 0.314245 0.000000 0.000000 0.100000 0.500000 1.000000
title_sentiment_polarity 33510.0 0.068661 0.252802 -1.000000 0.000000 0.000000 0.136364 1.000000
abs_title_subjectivity 33510.0 0.343692 0.188397 0.000000 0.166667 0.500000 0.500000 0.500000
abs_title_sentiment_polarity 33510.0 0.145570 0.217789 0.000000 0.000000 0.000000 0.225000 1.000000
log_n_tokens_content 33510.0 5.999449 1.273157 -2.302585 5.606170 6.102782 6.634765 9.044770
log_n_unique_tokens 33510.0 -0.491878 0.282137 -2.302585 -0.566697 -0.457923 -0.356675 6.552651
log_n_non_stop_words 33510.0 0.057017 0.303731 -2.302585 0.095310 0.095310 0.095310 6.948993
log_n_non_stop_unique_tokens 33510.0 -0.275653 0.291904 -2.302585 -0.316947 -0.235202 -0.159611 6.477126
log_num_hrefs 33510.0 1.964819 0.986069 -2.302585 1.410987 1.960095 2.572612 5.717357
log_num_self_hrefs 33510.0 0.645982 1.345538 -2.302585 0.095310 1.131402 1.410987 4.754452
log_num_imgs 33510.0 0.366843 1.456365 -2.302585 0.095310 0.095310 1.131402 4.852811
log_num_videos 33510.0 -1.349393 1.432197 -2.302585 -2.302585 -2.302585 0.095310 4.318821
log_kw_min_min 33510.0 1.155246 1.721834 0.000000 0.000000 0.000000 1.791759 5.937536
log_kw_max_min 33510.0 6.312276 1.624741 -2.302585 6.096050 6.486313 6.907855 12.606190
log_kw_avg_min 33510.0 5.291667 1.158304 0.000000 4.963019 5.477788 5.892827 10.664991
log_kw_min_max 33510.0 3.853464 5.606529 -2.302585 -2.302585 7.170196 8.824693 13.645078
log_kw_max_avg 33510.0 8.401823 0.641385 -2.302585 8.161793 8.305153 8.584248 12.606190
log_kw_avg_avg 33510.0 7.901533 0.560973 -2.302585 7.746331 7.914672 8.093026 10.682073
log_self_reference_min_shares 33510.0 5.702312 3.891489 -2.302585 6.429881 7.003156 7.783266 13.445027
log_self_reference_max_shares 33510.0 6.438064 4.257466 -2.302585 6.907855 7.901044 8.895643 13.638415
log_self_reference_avg_sharess 33510.0 6.179735 4.104882 -2.302585 6.863995 7.649740 8.444644 13.445027
log_shares 33510.0 7.412123 0.876603 0.095310 6.835292 7.244299 7.824086 13.445027

05. Visualize Attributes (15)

  • Visualize the most interesting attributes (at least 5 attributes, your opinion on what is interesting). Important: Interpret the implications for each visualization. Explain for each attribute why the chosen visualization is appropriate.*
In [19]:
# Most interesting attributes (12 and 1), just the interesting variables

df_clean.hist(figsize=(12,12))
Out[19]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1c26aa0a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1cbe1748>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a1cc07a58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26ae9d68>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1c26b1c0b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26b1c0f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26b706d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26b979e8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1c26bc1cf8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26bf5048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26c1b358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26c47668>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1c26c70978>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26c9bc88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26cc4f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1c26cf62e8>]],
      dtype=object)

06. Explore Joint Attributes (15)

  • Visualize relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate. Explain any interesting relationships.*
In [ ]:
# Put the pairplot or scatterplot for all the varibles or just mention we did it?
In [20]:
# Pairplot for log transformed variables, as grouped by Channel

sns.pairplot(df, vars=[' average_token_length', ' num_keywords', ' global_subjectivity',' title_sentiment_polarity',
                       ' abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs', 
                       'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue=" Channel", palette="husl", size=3)
Out[20]:
<seaborn.axisgrid.PairGrid at 0x1c26df3a58>
In [21]:
# Pairplot for log transformed variables, as grouped by Weekday

sns.pairplot(df, vars=[' average_token_length', ' num_keywords', ' global_subjectivity',' title_sentiment_polarity',
                       ' abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs', 
                       'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue=" weekday", palette="husl", size=2)
Out[21]:
<seaborn.axisgrid.PairGrid at 0x1c2baa34a8>

07. Explore Attributes and Class (10)

  • Identify and explain interesting relationships between features and the class you are trying to predict (i.e., relationships with variables and the target classification).*

08. New Features (5)

  • Are there other features that could be added to the data or created from existing features? Which ones? *

09. Exceptional Work (10)

  • You have free reign to provide additional analyses. One idea: implement dimensionality reduction, then visualize and interpret the results.*
In [22]:
numeric = [c for i,c in enumerate(df.columns) if df.dtypes[i] in [np.float64, np.int64]]
len(numeric)

# Correlation 
cmap = sns.diverging_palette(255, 133, l=60, n=7, as_cmap=True, center="dark")
sns.clustermap(df[numeric].corr(), figsize=(14, 14), cmap=cmap);
In [23]:
# plot the correlation matrix using seaborn
sns.set(style="darkgrid") # one of the many styles to plot using

df_heatmap = plt.subplots(figsize=(10, 10))

sns.heatmap(df.corr(), cmap="BuPu")
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x10433c630>
In [24]:
# Cut log_shares into 2 groups (0, 1)

df_cut = df['log_shares_cut'] = pd.qcut(df['log_shares'], 2, labels = ('unpopular', 'popular'))
In [ ]:
# df['log_shares_cut'] = pd.qcut(df['log_shares'], 3, labels = False)

# pd.qcut(range(5), 3, labels=["good","medium","bad"])
In [25]:
# Get 'log_shares' position
df.columns.get_loc('log_shares')
Out[25]:
46
In [26]:
# Drop the above column
df.drop(df.columns[46], axis=1, inplace=True)
In [27]:
# Samples for pairplot as group by the log_share_cut (0, 1)

sns.pairplot(df, vars = [' average_token_length', ' num_keywords', ' global_subjectivity',' title_sentiment_polarity',
                       ' abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs', 
                       'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess'], hue = "log_shares_cut", palette="husl", size=2)
Out[27]:
<seaborn.axisgrid.PairGrid at 0x1c364b0f60>
In [28]:
# Pick log transformed variables, transform and prepare for PCA 

from sklearn.preprocessing import StandardScaler
features = [' average_token_length', ' num_keywords', ' global_subjectivity',' title_sentiment_polarity',
            ' abs_title_subjectivity', 'log_n_tokens_content', 'log_n_unique_tokens','log_num_hrefs',
            'log_num_self_hrefs','log_num_imgs', 'log_num_videos', 'log_self_reference_avg_sharess']
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['log_shares_cut']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)
In [29]:
# Try PCA

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])
In [30]:
# Concat two component and prepare to plot

finalDf = pd.concat([principalDf, df[['log_shares_cut']]], axis = 1)
finalDf.head(10)
Out[30]:
principal component 1 principal component 2 log_shares_cut
0 -1.684700 -1.913425 unpopular
1 1.010819 2.862638 unpopular
2 0.416002 0.751330 popular
3 1.469637 2.239519 unpopular
4 -0.155334 1.734597 unpopular
5 -0.315230 -0.641018 unpopular
6 -1.780766 -1.912482 unpopular
7 0.893568 2.773147 unpopular
8 0.090845 -1.091994 popular
9 0.644319 2.873947 unpopular
In [31]:
# Plot 2 component PCA

fig = plt.figure(figsize = (6,6))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
log_shares_cuts = ['unpopular', 'popular'] # 0 = unpopular, 1 = popular
colors = ['r', 'b']

for log_shares_cut, color in zip(log_shares_cuts, colors):
    indicesToKeep = finalDf['log_shares_cut'] == log_shares_cut
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 15)
ax.legend(log_shares_cuts)
ax.grid()